Skip to content

Mark HTTP/1.1 async connection as not open (non-reusable) as soon as it becomes closed by the opposite endpoint #543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ok2c
Copy link
Member

@ok2c ok2c commented Aug 7, 2025

@rschmitt This should reduce the window of the stale connection lease race condition somewhat.

@ok2c ok2c requested a review from rschmitt August 7, 2025 16:17
@@ -277,13 +278,14 @@ public final void onInput(final ByteBuffer src) throws HttpException, IOExceptio
return;
}

boolean endOfStream = false;
if (incomingMessage == null) {
final int bytesRead = inbuf.fill(ioSession);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a key issue right here: we can't read bytes _and _ endOfStream in the same operation. So even if the closure of the connection is already known to the host (at some level of abstraction), this code doesn't learn about it until the next event loop iteration, by which point the connection has already been returned to the pool and potentially reused.

I guess one question here is: should we read in a loop until bytesRead comes back as either 0 or -1? Kind of an edge-triggered approach, where you drain the file descriptor until read() returns would-block (EAGAIN). If possible, I think that approach, combined with this change set, would greatly reduce or eliminate purely internal race conditions in the client's connection management.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one question here is: should we read in a loop until bytesRead comes back as either 0 or -1?

@rschmitt The protocol handler should be reading incoming data greedily until there is no more data to be read. I will review the code tomorrow to make sure it is really so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschmitt You are right. The protocol handler reads message body content greedily but stops at the end of the message and waits until the next input event. I am looking into it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschmitt Please test the updated change-set in your test environment and let me know if it improves the situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran my tests against your change set. Here are my findings:

  1. Uncontended connection reuse is moderately more reliable with HTTP, but not with HTTPS.
  2. Contended connection reuse is the same, with failure rates of 67% for small batches and 50% for large batches.

I've pushed my reproducer here so you can experiment with it. One thing I quickly discovered is that your change set is ineffective on HTTPS because SSLIOSession::read is just returning the same cached value every time:

    @Override
    public int read(final ByteBuffer dst) {
        return endOfStream ? -1 : 0;
    }

This is altogether different from HTTP, where I've confirmed that the duplexer reads endOfStream on the second call to fillBuffer().

To me, the biggest mystery is why contended connection reuse is failing at such a consistently high rate:

Core version: 5.4-alpha1-SNAPSHOT
Client version: 5.6-alpha1-SNAPSHOT

Http: Sequential requests (rapid): 2,496 succeeded; 4 failed (99.84% success rate)
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Http: Single large batch: 15 succeeded; 15 failed (50.00% success rate)
Http: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate)

Https: Sequential requests (rapid): 2,235 succeeded; 265 failed (89.40% success rate)
Https: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
Https: Single large batch: 15 succeeded; 15 failed (50.00% success rate)
Https: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate)

One fascinating detail is that your change set caused the HTTP batch tests to see RequestNotExecutedException, where previously they saw ConnectionClosedException. Before:

Http: Sequential requests (rapid): 2,479 succeeded; 21 failed (99.16% success rate)
  14 not executed, 6 closed, 0 reset, 1 other
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
  0 not executed, 0 closed, 0 reset, 0 other
Http: Single large batch: 15 succeeded; 15 failed (50.00% success rate)
  0 not executed, 15 closed, 0 reset, 0 other
Http: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate)
  0 not executed, 5 closed, 0 reset, 0 other

After:

Http: Sequential requests (rapid): 2,491 succeeded; 9 failed (99.64% success rate)
  9 not executed, 0 closed, 0 reset, 0 other
Http: Sequential requests (slow): 10 succeeded; 0 failed (100.00% success rate)
  0 not executed, 0 closed, 0 reset, 0 other
Http: Single large batch: 15 succeeded; 15 failed (50.00% success rate)
  15 not executed, 0 closed, 0 reset, 0 other
Http: Multiple small batches: 10 succeeded; 5 failed (66.67% success rate)
  5 not executed, 0 closed, 0 reset, 0 other

This must mean that the connection is being returned to the pool the very instant the HTTP response is read, before the second fillBuffer() call takes place (or it means that the new endOfStream check is ineffective).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must mean that the connection is being returned to the pool the very instant the HTTP response is read

@rschmitt This is precisely what the protocol handler does. As soon as the response message has been fully consumed and the connection is deemed re-usable the protocol handler hands it back to the connection pool which may immediately lease it out to another caller. No amount of trickery can make the protocol handler aware of the opposite endpoint's intent to not honor the protocol requirements with regards of the connection persistence, especially when using TLS to negotiate the connection closure. It can get lucky and read the end of stream over a plain connection but this is just luck.

There is no solution to this problem other then a request recovery and re-trial mechanism and possibly some sort of a stale connection check. I suggest we focus on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes in this PR significantly improved things by causing RequestNotExecutedException (a safely retryable exception) to be thrown much more consistently in this case. I was hoping we could figure out how to do the same thing in the case of TLS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping we could figure out how to do the same thing in the case of TLS.

@rschmitt I do not think this can be done. Not without a major rewrite of the TLS layer. Presently the TLS layer and the HTTP protocol layers are decoupled and the protocol layer can interact with the lower transport layer through a generic IOSession interface. I do not see a way of making it reliably run through the TLS close notify handshake from the protocol handler.

I really think we ought to focus on #547 or something similar (such as using OPTIONS as a poor man's version of H2 ping)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough. In that case, I'm indifferent on whether to ship the present PR. Plaintext HTTP is almost irrelevant today.

I do have some unfinished business involving a bug in UDS connection closure with CloseMode.GRACEFUL, and I think it might be related to the things we're talking about in this thread. I'll try to get back to that this week.

@ok2c ok2c force-pushed the mark_half_closed_async_http1_as_not_open branch from df89f06 to 24ccaad Compare August 8, 2025 09:05
@ok2c ok2c force-pushed the mark_half_closed_async_http1_as_not_open branch from 24ccaad to dfa2cd5 Compare August 8, 2025 10:40
@ok2c ok2c closed this Aug 9, 2025
@ok2c ok2c deleted the mark_half_closed_async_http1_as_not_open branch August 15, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants